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Abstract 

In query learning, the goal is to identify an unknown object while minimizing the number of yes or no 
questions (queries) posed about that object. We consider three extensions of this fundamental problem 
that are motivated by practical considerations in real-world, time-critical identification tasks such as 
emergency response. First, we consider the problem where the objects are partitioned into groups, and 
the goal is to identify only the group to which the object belongs. Second, we address the situation 
where the queries are partitioned into groups, and an algorithm may suggest a group of queries to a 
human user, who then selects the actual query. Third, we consider the problem of query learning in 
the presence of persistent query noise, and relate it to group identification. To address these problems 
we show that a standard algorithm for query learning, known as the splitting algorithm or generalized 
binary search, may be viewed as a generalization of Shannon- Fano coding. We then extend this result to 
the group-based settings, leading to new algorithms. The performance of our algorithms is demonstrated 
on simulated data and on a database used by first responders for toxic chemical identification. 

1 Introduction 

In emergency response applications, as well as other time-critical diagnostic tasks, there is a need to rapidly 
identify a cause by selectively acquiring information from the environment. For example, in the problem 
of toxic chemical identification, a first responder may question victims of chemical exposure regarding the 
symptoms they experience. Chemicals that are inconsistent with the reported symptoms may then be 
eliminated. Because of the importance of this problem, several organizations have constructed extensive 
evidence-based databases (e.g., Haz-MapQ that record toxic chemicals and the acute symptoms which they 
are known to cause. Unfortunately, many symptoms tend to be nonspecific (e.g., vomiting can be caused 
by many different chemicals), and it is therefore critical for the first responder to pose these questions in 
a sequence that leads to chemical identification in as few questions as possible. 

This problem has been studied from a mathematical perspective for decades, and has been described 
variously as query learning (with membership queries) [I], active learning [3J, object/entity identification [31 
0] , and binary testing jH [5] . In this work we refer to the problem as query learning or object identification. 
The standard mathematical formulation of query learning is often idealized relative to many real-world 
diagnostic tasks, in that it does not account for time constraints and resulting input errors. In this paper 
we investigate algorithms that extend query learning to such more realistic settings by addressing the need 
for rapid response, and error-tolerant algorithms. 

In query learning there is an unknown object belonging to a set = {9\,--- ,0m} of M objects 
and a set Q = {q\, ■ ■ ■ , ^tv} of N distinct subsets of known as queries. Additionally, the vector IT = 
(tti, ■ ■ ■ ,ttm) denotes the a priori probability distribution over 0. The goal is to determine the unknown 
object 8 € through as few queries from Q as possible, where a query q £ Q returns a value 1 if E q, and 
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otherwise. A query learning algorithm thus corresponds to a decision tree, where the internal nodes are 
queries, and the leaf nodes are objects. Problems of this nature arise in applications such as fault testing 
[6] [7| , machine diagnostics [8] , disease diagnosis [5j [9] , computer vision [ID] and active learning [2] [11] . 
Algorithms and performance guarantees have been extensively developed in the literature, as described in 
Section 11.11 below. 

In the context of toxic chemical identification, the objects are chemicals, and the queries are symptoms. 
A query learning algorithm will prompt the first responder with a symptom. Once the presence or absence 
of that symptom is determined, a new symptom is suggested by the algorithm, and so on, until the chemical 
is uniquely determined. In this paper, we consider variations on this basic query learning framework that 
are motivated by toxic chemical identification, and are naturally applicable to many other time-critical 
diagnostic tasks. In particular, we develop theoretical results and new algorithms for what might be 
described as group-based query learning. 

First, we consider the case where is partitioned into groups of objects, and it is only necessary 
to identify the group to which the object belongs. For example, the appropriate response to a toxic 
chemical may only depend on the class of chemicals to which it belongs (pesticide, corrosive acid, etc.). As 
our experiments reveal, a query learning algorithm designed to rapidly identify individual objects is not 
necessarily efficient for group identification. 

Second, we consider the problem where the set Q of queries is partitioned into groups (respiratory 
symptoms, cardio symptoms, etc.). Instead of suggesting specific symptoms to the user, we design an 
algorithm that suggests a group of queries, and allows the user the freedom to input information on any 
query in that group. Although such a system will theoretically be less efficient, it is motivated by the fact 
that in a practical application, some symptoms will be easier for a given user to understand and identify. 
Instead of suggesting a single symptom, which might seem out of the blue to the user, suggesting a query 
group will be less bewildering, and hence lead to a more efficient and accurate outcome. Our experiments 
demonstrate that the proposed algorithm based on query groups identifies objects in nearly as few queries 
as a fully active method. 

Third, we apply our algorithm for group identification to the problem of query learning with persistent 
query noise. Persistent query noise occurs when the response of a query is in error, but cannot be resampled, 
as is often assumed in the literature. Such is the case when the presence or absence of a symptom is 
incorrectly determined, which is more likely in a stressful emergency response scenario. Experiments show 
our method offers significant gains over algorithms not designed for persistent query noise. 

Our algorithms are derived in a common framework, and are based on a reinterpretation of a standard 
query learning algorithm (the splitting algorithm, or generalized binary search) as a generalized form of 
Shannon-Fano coding. We first establish an exact formula for the expected number of queries by an 
arbitrary decision tree, and show that the splitting algorithm effectively performs a greedy, top-down 
optimization of this objective. We then extend this formula to the case of group identification and query 
groups, and develop analogous greedy algorithms. In the process, we provide a new interpretation of 
impurity-based decision tree induction for multi-class classification. 

We apply our algorithms to both synthetic data and to the WISER database (version 4.21). WISERj^j 
which stands for Wireless Information System for Emergency Responders, is a decision support system 
developed by the National Library of Medicine (NLM) for first responders. This database describes the 
binary relationship between 298 toxic chemicals (corresponds to the number of distinguishable chemicals in 
this database) and 79 acute symptoms. The symptoms are grouped into 10 categories (e.g., neurological, 
cardio) as determined by NLM, and the chemicals are grouped into 16 categories (e.g., pesticides, corrosive 
acids) as determined by a toxicologist and a Hazmat expert. 
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1.1 Prior and related work 



The problem of selecting an optimal sequence of queries from Q to uniquely identify the unknown object 6 is 
equivalent to determining an optimal binary decision tree, where each internal node in the tree corresponds 
to a query, each leaf node corresponds to a unique object from the set O and the optimality is with respect 
to minimizing the expected depth of the leaf node corresponding to 9. In the special case when the query 
set Q is complete (where a query set Q is said to be complete if for any S C there exists a query 
q G Q such that either q = SorQ\q = S), the problem of constructing an optimal binary decision 
tree is equivalent to construction of optimal variable-length binary prefix codes with minimum expected 
length. This problem has been widely studied in information theory with both Shannon [12] and Fano 
[T3] independently proposing a top-down greedy strategy to construct suboptimal binary prefix codes, 
popularly known as Shannon-Fano codes. Later Huffman |14j derived a simple bottom- up algorithm to 
construct optimal binary prefix codes. A well known lower bound on the expected length of binary prefix 
codes is given by the Shannon entropy of the probability distribution LT [15 . . 

When the query set Q is not complete, a query learning problem can be considered as "constrained" 
prefix coding with the same lower bound on the expected depth of a tree. This problem has also been 
studied extensively in the literature with Garey [31 Uj proposing a dynamic programming based algorithm 
to find an optimal solution. This algorithm runs in exponential time in the worst case. Later, Hyafil 
and Rivest [16] showed that determining an optimal binary decision tree for this problem is NP-complete. 
Thereafter, various greedy algorithms [5][T7][IH] have been proposed to obtain a suboptimal binary decision 
tree. The most widely studied algorithm known as the splitting algorithm |5 or generalized binary search 
(GBS) [21 QT], selects a query that most evenly divides the probability mass of the remaining objects 
[2J [5] [11] [19] . Various bounds on the performance of this greedy algorithm have been established in 
[2] [5j [TT]. Goodman and Smyth [19] observe that this algorithm can be viewed as a generalized version of 
Shannon-Fano coding. In Section [2] we demonstrate the same through an alternative approach that can 
be generalized to group-based query learning problems, leading to efficient algorithms in these settings. As 
far as we know, there has been no previous work on group queries or group identification. 

Though most of the above work has been devoted to query learning in the ideal setting assuming no 
noise, it is unrealistic to assume that the responses to queries are without error in many applications. 
The problem of learning in the presence of query noise has been studied in |11[ [20] [2T] where the queries 
can be resampled or repeated. However, in certain applications, resampling or repeating the query does 
not change the query response confining the algorithm to non-repeatable queries. The work by Renyi in 
|22] is regarded to be the first to consider this more stringent noise model, also referred to as persistent 
noise in the literature |23[ [Ml US] ■ However, his work has focused on the passive setting where the queries 
are chosen at random. Learning under persistent noise model has also been studied in |23| EH [26] where 
the goal was to identify or learn Disjunctive Normal Form (DNF) formulae from noisy data. The query 
(label) complexity of pool-based active learning in the Probably Approximately Correct (PAC) model in 
the presence of persistent classification noise has been studied in [25] and active learning algorithms in this 
setting have been proposed in |25[ [27] . Here, we focus on the problem of query learning under the persistent 
noise model where the goal is to uniquely identify the true object. Finally, this work was motivated by 
earlier work that applied GBS to WISER [28]. 

1.2 Notation 

We denote a query learning problem by a pair (B, n) where B is a binary matrix with by equal to 1 if 9i € qj, 
and otherwise. A decision tree T constructed on (B, n) has a query from the set Q at each of its internal 
nodes with the leaf nodes terminating in the objects from the set O. At each internal node in the tree, 
the object set under consideration is divided into two subsets, corresponding to the objects that respond 
and 1 to the query, respectively. For a decision tree with L leaves, the leaf nodes are indexed by the set 
C = {1, • • • , L} and the internal nodes are indexed by the set X = {L + 1, • • • , 2L — 1}. At any internal 
node a £ I, let 1(a), r (a) denote the "left" and "right" child nodes, where the set Q a Q O corresponds 
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to the set of objects that reach node 'a', and the sets &i( a ) Q @a>@r(a) Q @a corresponds to the set of 
objects that respond and 1 to the query at node 'a', respectively. We denote by TTQ a := Y^u-eieQa}^' 
the probability mass of the objects under consideration at any node 'a' in the tree. Also, at any node 'a', 
the set Q a Q Q corresponds to the set of queries that have been performed along the path from the root 
node up to node 'a'. 

We denote the Shannon entropy of a vector II = (7Ti, • • ■ ,ttm) by -ff(II) := — ^ 7Tj log 2 7Tj and the 
Shannon entropy of a proportion ir S [0, 1] by H(tt) := — vrlog 2 vr — (1 — n) log 2 (l — vr), where we use the 
limit, lim 7rlog 2 7r = to define the limiting cases. Finally, we use the random variable K to denote the 

number of queries required to identify an unknown object 9 or the group of an unknown object using a 
given tree. 



2 Generalized Shannon-Fano Coding 

Before proceeding to group-based query learning, we first present an exact formula for standard query 
learning problems. This result allows us to interpret the splitting algorithm or GBS as generalized Shannon- 
Fano coding. Furthermore, our proposed algorithms for group-based settings are based on generalizations 
of this result. 

First, we define a parameter called the reduction factor on the binary matrix/tree combination that 
provides a useful quantification on the expected number of queries required to identify an unknown object. 

Definition 1. A reduction factor at any internal node 'a' in a decision tree is defined as 

p a = max(7T0 ;(£i) ,7re r{a) )/7re a and the overall reduction factor of a tree is defined as p = max ae j p a . 

Note from the above definition that 0.5 < p a < p < 1 and we describe a decision tree with p = 0.5 to 
be a perfectly balanced tree. 

Given a query learning problem (B,I1), let 7~(B,n) denote the set of decision trees that can uniquely 
identify all the objects in the set 0. For any decision tree T £ T(B,n), let {p a }a£i denote the set of 
reduction factors and let di denote the depth of object 8i in the tree. Then the expected number of queries 
required to identify an unknown object using the given tree is equal to 

M M 

E[K) =J2 Pt ( 9 = GiMK\e = 0i] = Y,Kidi. 

i=l i=l 

Theorem 1. The expected number of queries required to identify an unknown object using a tree T with 
reduction factors {p a } a ei constructed on (B,n) is given by 



where TTQ a := ^- 



g(n) 

aex ^aex^e a H( Pa 



E[K] = H(U) + 5> e „[l - H(p a )} = v ' (1) 



Proof. The first equality is a special case of Theorem [2] below. The second equality follows from the 
observation K[K] = J2i=i ^idi = Saez 7re> a- Hence replacing ire a with 7Te a • in the first equality leads 
to the result. □ 

In the second equality, the term X^aex ^o a ^(Pa) denotes the average entropy of the reduction factors, 
weighted by the proportion of times each internal node 'a' is queried in the tree. This theorem re-iterates 
an earlier observation that the expected number of queries required to identify an unknown object using 
a tree constructed on (B,II) (where the query set Q is not necessarily a complete set) is bounded below 
by its entropy H(JJ). It also follows from the above result that a tree attains this minimum value (i.e., 
K[K] = H(T1)) iff it is perfectly balanced, i.e., the overall reduction factor p of the tree is equal to 0.5. 
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From the first equality, the problem of finding a decision tree with minimum E[iT] can be formulated 
as the following optimization problem 



T m -ff (n) + E QG x [1 - H( Pa )} (2) 

Since II is fixed, the optimization problem reduces to minimizing ^2 a& j^e a [^ — H(p a )\ over the set of 
trees T(B,I1). Note that the reduction factor p a depends on the query chosen at node 'a' in a tree T. As 
mentioned earlier, finding a global optimal solution for this optimization problem is NP-complete. 

Instead, we may take a top down approach and minimize the objective function by minimizing the 
term vre a [l — H{p a )\ at each internal node, starting from the root node. Since 7re a is independent of the 
query chosen at node 'a', this reduces to minimizing p a (i.e., choosing a split as balanced as possible) at 
each internal node a £ X. The algorithm can be summarized as shown in Algorithm [T] below. 



Generalized Binary Search (GBS) 

Initialization : Let the leaf set consist of the root node 
while some leaf node 'a' has |0 tt | > 1 do 
for each query q 6 Q\ Q a do 

Find @u a ) and @ r f a ) produced by making a split with query q 
Compute the reduction factor p a produced by query q 
end 

Choose a query with the smallest reduction factor 
Form child nodes 1(a), r (a) 
end 

Algorithm 1: Greedy decision tree algorithm for object identification 

Note that when the query set Q is complete, Algorithm [T] is similar to Shannon-Fano coding |1 2|, [T3] . 
The only difference is that in Shannon-Fano coding, for computational reasons, the queries are restricted 
to those that are based on thresholding the prior probabilities 7Tj. 

Corollary 1. The standard splitting algorithm/ GBS is a greedy algorithm to minimize the expected number 
of queries required to uniquely identify an object. 

Corollary [2] below follows from Theorem [T] It states that given a tree T with overall reduction factor 
p < 1, the average complexity of identifying an unknown object using this tree is 0(log 2 M). Recently, 
Nowak [11] showed there are geometric conditions (incoherence and neighborliness) that also bound the 
worst-case depth of the tree to be 0(log 2 M), assuming a uniform prior on objects. The conditions imply 
that the reduction factors are close to \ except possibly near the very bottom of the tree where they could 
be close to 1. 

Corollary 2. The expected number of queries required to identify an unknown object using a tree with 
overall reduction factor p constructed on (B,I1) is bounded above by 

^ rr ^ H(U) log 2 M 
E\K] < ^-V^ < , , 

Proof. Using the second equality in Theorem [T] we get 

F r^i = ff ( n ) < ^(n) iog,M 

1 1 Z a ei*e a H(p a ) - H(p) ~ H(p) 

where the first inequality follows from the definition of p, p > p a > 0.5, Va G T and the last inequality 
follows from the concavity of the entropy function. □ 
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Figure 1 : Toy Example 1 



Figure 2: Decision tree constructed using GBS for group 
identification on toy example 1 



In the sections that follow, we show how Theorem [T] and Algorithm [T] may be generalized, leading to 
principled strategies for group identification, query learning with group queries and query learning with 
persistent noise. 

3 Group Identification 

We now move to the problem of group identification, where the goal is not to determine the object, but only 
the group to which the object belongs. Here, in addition to the binary matrix B and a priori probability 
distribution II on the objects, the group labels for the objects are also provided, where the groups are 
assumed to be disjoint. 

We denote a query learning problem for group identification by (B,II,y), where y = (2/1, ••• ,Vm) 
denotes the group labels of the objects, y% £ {1, ■ ■ ■ ,tu}. Let {O*}^ be a partition of the object set 0, 
where O* denotes the set of objects in that belong to group i. It is important to note here that the 
group identification problem cannot be simply reduced to a standard query learning problem with groups 
{O 1 ,--- , O m } as meta "objects," since the objects within a group need not respond the same to each 
query. For example, consider the toy example shown in Figure [T] where the objects Ox, O2 and #3 belonging 
to group 1 cannot be considered as one single meta object as these objects respond differently to queries 
qx and q 3 . 

In this context, we also note that GBS can fail to find a good solution for a group identification problem 
as it does not take the group labels into consideration while choosing queries. Once again, consider the 
toy example shown in Figure [T] where just one query (query (72) is sufficient to identify the group of an 
unknown object, whereas GBS requires 2 queries to identify the group when the unknown object is either 
02 or #4, as shown in Figure |2j Hence, we develop a new strategy which accounts for the group labels when 
choosing the best query at each stage. 

Note that when constructing a tree for group identification, a greedy, top-down algorithm terminates 
splitting when all the objects at the node belong to the same group. Hence, a tree constructed in this 
fashion can have multiple objects ending in the same leaf node and multiple leaves ending in the same 
group. 

For a tree with L leaves, we denote by C l C C = {1, • • • , L} the set of leaves that terminate in group i. 
Similar to 0* C 0, we denote by 0^ C a the set of objects that belong to group i at any internal node 
a 6 lin the tree. Also, in addition to the reduction factors defined in Section [2j we define a new set of 
reduction factors called the group reduction factors at each internal node. 

Definition 2. The group reduction factor of group i at any internal node 'a' in a decision tree is defined 
as p\ = max(vr e j ttq* )/7r e < • 

((a; r ( a ) 

Given (B,n,y), let T(B,n,y) denote the set of decision trees that can uniquely identify the groups 
of all objects in the set 0. For any decision tree T G T(B,n,y), let p a denote the reduction factor and 
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let {Pali^i denote the set of group reduction factors at each of its internal nodes. Also, let dj denote the 
depth of leaf node j £ £ in the tree. Then the expected number of queries required to identify the group 
of an unknown object using the given tree is equal to 



E[K] 



m 

^Pr(# G e^EfA'Ifl G 9' 
i=i 
m 



i=l 



E 



Theorem 2. The expected number of queries required to identify the group of an object using a tree T with 
reduction factors {p a }aex an d group reduction factors {p" l a }™ =1 ,\/a G T constructed on (B,n,y), is given by 



E[K} = H(IL y ) + J2ne a 



1 - H(p a ) + 



E 



7T 



(3) 



where Hy denotes the probability distribution of the object groups induced by the labels y, i.e. 



(7T 1, • • • ,7T0m). 

Proof. Special case of Theorem [6] below. 



□ 



The above theorem states that given a query learning problem for group identification (B,n,y), the 
expected number of queries required to identify the group of an unknown object is lower bounded by the 
entropy of the probability distribution of the groups. It also follows from the above result that this lower 
bound is achieved iff there exists a perfectly balanced tree (i.e. p = 0.5) with the group reduction factors 
equal to 1 at every internal node in the tree. Also, note that Theorem [I] is a special case of this theorem 
where each group has size 1 leading to p\ = 1 for all groups at every internal node. 

Using Theorem [2] the problem of finding a decision tree with minimum K[K] can be formulated as the 
following optimization problem 



min y^„ C T7TO 
Ter(B,n,y) ^ aeI ° a 



1 - H( Pa ) + TZi ^H(pl) 



(4) 



Note that here both the reduction factor p a and the group reduction factors {p l a }™ =1 depend on the query 
chosen at node 'a'. Also, the above optimization problem being a generalized version of the optimization 
problem in ^ is NP-complete. Hence, we propose a suboptimal approach to solve the above optimization 
problem where we optimize the objective function locally instead of globally. We take a top-down approach 



l-H( Pa ) + Z 



1 ""Oa 



H(pl) 



at each 



and minimize the objective function by minimizing the term C a :- 
internal node, starting from the root node. The algorithm can be summarized as shown in Algorithm [2] 
below. This algorithm is referred to as GISA (Group Identification Splitting Algorithm) in the rest of this 
paper. 

Note that the objective function in this algorithm consists of two terms. The first term [1 — H{p a )\ 
favors queries that evenly distribute the probability mass of the objects at node 'a' to its child nodes 
(regardless of the group) while the second term ^ - L£L H{p t a ) favors queries that transfer an entire group 
of objects to one of its child nodes. 



3.1 Connection to Impurity-based Decision Tree Induction 

As a brief digression, in this section we show a connection between the above algorithm and impurity-based 
decision tree induction. In particular, we show that the above algorithm is equivalent to the decision tree 
splitting algorithm used in the C4.5 software package [29 . Before establishing this result, we briefly review 
the multi-class classification setting where impurity-based decision tree induction is popularly used. 
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Group Identification Splitting Algorithm (GISA) 


Initialization : Let the leaf set consist of the root node 


while some leaf node 'a ' has more than one group of objects do 




for each query qj € Q \ Q a do 






Compute {p l a }™ =1 and p a produced by making a split with query qj 






Compute the cost C a (j) of making a split with query qj 




end 




Choose a query with the least cost C a at node 'o' 




Form child nodes 1(a), r (a) 


end 





Algorithm 2: Greedy decision tree algorithm for group identification 



In the multi-class classification setting, the input is training data Xi, • • ■ ,Xm sampled from some input 
space (with an underlying probability distribution) along with their class labels, yi, ■ ■ ■ , yu an d the task is 
to construct a classifier with the least probability of misclassification. Decision tree classifiers are grown by 
maximizing an impurity-based objective function at every internal node to select the best classifier from a 
set of base classifiers. These base classifiers can vary from simple axis-orthogonal splits to more complex 
non-linear classifiers. The impurity-based objective function is 



7r e„ 



7T0 



i(&«a)) + — ^/(e r(a) ) 



(5) 



which represents the decrease in impurity resulting from split 'a'. Here I(@ a ) corresponds to the measure 
of impurity in the input subspace at node 'a' and %Q a corresponds to the probability measure of the input 
subspace at node 'a'. 

Among the various impurity functions suggested in literature [301 EI] , the entropy measure used in the 
C4.5 software package is popular. In the multi-class classification setting with m different class labels, 
this measure is given by 



n«aj- 2-^=1 n @a iQ g 7r ea 



(6) 



where 7Te a , 7r@i are empirical probabilities based on the training data. 

Similar to a query learning problem for group identification, the input here is a binary matrix B with bij 
denoting the binary label produced by base classifier j on training sample i, and a probability distribution 
II on the training data along with their class labels y. But unlike in a query learning problem where the 
nodes in a tree are not terminated until all the objects belong to the same group, the leaf nodes here are 
allowed to contain some impurity in order to avoid overfitting. The following result extends Theorem [2] to 
the case of impure leaf nodes. 

Theorem 3. The expected depth of a leaf node in a decision tree classifier T with reduction factors {p a }aei 
and class reduction factors {p l a } 7 ^ =1 ,ya £ X constructed on a multi-class classification problem (B,II,y), is 
given by 



E[K]=ff(n y )+J> 0a 



1 - H(p a ) + ^HiPa) 



aeC 



(7) 



where U y denotes the probability distribution of the classes induced by the class labels y, i.e., Tl y 
(ttqi, • • • , 7T0m) and I(@ a ) denotes the impurity in leaf node 'a' given by |6p. 

Proof. The proof is given in Appendix I. 



□ 



The only difference compared to Theorem [2] is the last term, which corresponds to the average impurity 
in the leaf nodes. 
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Theorem 4. At every internal node in a tree, minimizing the objective function C a := 1 — H{p a ) + 



YZi '^H{(f a ) is equivalent to maximizing I{Q a ) - ^I(Q l(a) ) + ^/( 0r(a) ) 



W0 

as the impurity function. 



with entropy measure 



Proof. The proof is given in Appendix I. □ 

Therefore, greedy optimization of ([7| at internal nodes corresponds to greedy optimization of impurity. 
Also, note that optimizing d7l) at a leaf assigns the majority vote class label. Therefore, we conclude 
that impurity-based decision tree induction with entropy as the impurity measure amounts to a greedy 
optimization of the expected depth of a leaf node in the tree. Also, Theorem [3] allows us to interpret 
impurity based splitting algorithms for multiclass decision trees in terms of reduction factors, which also 
appears to be a new insight. 



4 Object identification under group queries 

In this section, we return to the problem of object identification. The input is a binary matrix B denoting 
the relationship between M objects and iV queries, where the queries are grouped a priori into n disjoint 
categories, along with the a priori probability distribution II on the objects. However, unlike the decision 
trees constructed in the previous two sections where the end user (for e.g., a first responder) has to go 
through a fixed set of questions as dictated by the decision tree, here, the user is offered more flexibility 
in choosing the questions at each stage. More specifically, the decision tree suggests a query group from 
the n groups instead of a single query at each stage, and the user can choose a query to answer from the 
suggested query group. 

A decision tree constructed with a group of queries at each stage has multiple branches at each internal 
node, corresponding to the size of the query group. Hence, a tree constructed in this fashion has multiple 
leaves ending in the same object. While traversing this decision tree, the user chooses the path at each 
internal node by selecting the query to answer from the given list of queries. Figure [4] demonstrates 
a decision tree constructed in this fashion for the toy example shown in Figure [3] The circled nodes 
correspond to the internal nodes, where each internal node is associated with a query group. The numbers 
associated with a dashed edge correspond to the probability that the user will choose that path over the 
others. The probability of reaching a node a G X in the tree given 6 G a is given by the product of 
the probabilities on the dashed edges along the path from the root node to that node, for example, the 
probability of reaching leaf node 9\ given 6 = 6\ in Figure [4] is 0.45. The problem now is to select the 
query categories that will identify the object most efficiently, on average. 

In addition to the terminology defined in Sections 1.2 and [2] we also define z = (z\, ■ ■ ■ , zn) to be the 



group labels of the queries, where Zj G {1, • • ■ , n}, Vj = 1, • • • , N. Let {Q l }™ =1 be a partition of the query 
set Q, where Q l denotes the set of queries in Q that belong to group i. Similarly, at any node 'a' in a tree, 
let Q\ and Q\ denote the set of queries in Q a and Q \Q a that belong to group i respectively. Let Pi(q) 
be the a priori probability of the user selecting query q G Q l at any node with query group i in the tree, 
where ^2 q& gi Pi(q) = 1- In addition, at any node 'a' in the tree, the function pi(q) = 0, Vq G Q\, since the 
user would not choose a query which has already been answered, in which case Pi{q) is renormalized. In 
our experiments we take Pi(q) to be uniform on Q l a . Finally, let z a G {1, • • • , n} denote the query group 
selected at an internal node 'a' in the tree and let p a denote the probability of reaching that node given 
9eQ a . 

We denote a query learning problem for object identification with query groups by (B,n, z,p). Given 
(B,n, z,p), let T(B,n, z,p) denote the set of decision trees that can uniquely identify all the objects in 
the set O with query groups at each internal node. For a decision tree T G T(B,n,z,p), let {p a (q)}q£Q z a 
denote the reduction factors of all the queries in the query group at each internal node a G I in the tree, 
where the reduction factors are treated as functions with input being a query. 

Also, for a tree with L leaves, let C 1 C C = {1, • • • , L} denote the set of leaves terminating in object 9i 
and let dj denote the depth of leaf node j G C. Then, the expected number of queries required to identify 
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Figure 3: Toy Example 2 



Figure 4: Decision tree constructed on toy example 2 for object identification under 
group queries 



the unknown object using the given tree is equal to 

M 

E[K] = J2Pr{9 = 9i)E[K\8 = 6i 



M 
i=l 



Theorem 5. The expected number of queries required to identify an object using a tree T £ T(B,II,z,p) 
is given by 



E[K] = H(n) + Y,p^ 



1- E PMH( Pa (q)) 

q£Q z °< 



Proof. Special case of Theorem [6] below. 



(8) 
□ 



Note from the above theorem, that given a query learning problem (B,Il,z,p), the expected number 
of queries required to identify an object is lower bounded by its entropy H(TL). Also, this lower bound can 
be achieved iff the reduction factors of all the queries in a query group at each internal node of the tree is 
equal to 0.5. In fact, Theorem [T] is a special case of the above theorem where each query group has just 
one query. 

Given a query learning problem (B, II, z, p), the problem of finding a decision tree with minimum K[K] 
can be formulated as the following optimization problem 



min EaGZ^Oa 

TeT(B,n,z, P ) ^ afcX 



1 - EyeQ-a Vz a {q)H{p a {q)) 



(9) 



Note that here the reduction factors p a (q),^q £ Q Za and the prior probability function p Za {q) depends 
on the query group z a £ {1, • • • , n} chosen at node 'a' in the tree. The above optimization problem 
being a generalized version of the optimization problem in ^ is NP-complete. A greedy top-down local 
optimization of the above objective function yields a suboptimal solution where we choose a query group 
that minimizes the term C a (j) := 1 — X^oeQJ Pj(q)H(p a (q)) at each internal node, starting from the root 
node. The algorithm as summarized in Algorithm [3] below is referred to as GQSA (Group Queries Splitting 
Algorithm) in the rest of this paper. 
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Group Queries Splitting Algorithm (GQSA) 

Initialization : Let the leaf set consist of the root node 
while some leaf node 'a' has \@a\ > 1 do 
for each query group with Q 3 a > 1 do 

Compute the prior probabilities of selecting queries within a group Pj{q),\/q G Q 3 at node 'a' 
Compute the reduction factors for all the queries in the query group {p a (<l)}q^Qj 
Compute the cost C a (j) of using query group j at node 'a' 
end 

Choose a query group j with the least cost C a (j) at node 'a' 

Form the left and the right child nodes for all queries with Pj{q) > in the query group 
end 

Algorithm 3: Greedy decision tree algorithm for object identification with group queries 



5 Group identification under group queries 

For the sake of completion, we consider here the problem of identifying the group of an unknown object 

8 G under group queries. The input is a binary matrix B denoting the relationship between M objects 
and N queries, where the objects are grouped into m groups and the queries are grouped into n groups. 
The task is to identify the group of an unknown object through as few queries from Q as possible where, 
at each stage, the user is offered a query group from which a query is chosen. 

As noted in Section [3] a decision tree constructed for group identification can have multiple objects 
terminating in the same leaf node. Also, a decision tree constructed for group identification with a query 
group at each internal node has multiple leaves terminating in the same group. Hence a decision tree 
constructed in this section can have multiple objects terminating in the same leaf node and multiple leaves 
terminating in the same group. Also, we use most of the terminology defined in Sections [3] and [4] here. 

We denote a query learning problem for group identification with query groups by (B, II, y , z, p) where 
y = (yi, • • • , dm) denotes the group labels on the objects, z = (zi, ■ ■ ■ , zn) denotes the group labels on the 
queries and p = (pi(q),--- ,Pn(q)) denotes the a priori probability functions of selecting queries within 
query groups. Given a query learning problem (B, II, y, z, p), let T(B, II, y, z, p) denote the set of decision 
trees that can uniquely identify the groups of all objects in the set with query groups at each internal 
node. For any decision tree T G T(B, II, y, z, p), let {p a (q)}q£Q z a denote the reduction factor set and let 
{{/ 5 a(o , )}^i}ije<2 z a denote the group reduction factor sets at each internal node o £ I in the tree, where 
z a G {1, • • • , n} denotes the query group selected at that node. 

Also, for a tree with L leaves, let C 1 C C = {1, • • • , L} denote the set of leaves terminating in object 
group i and let dj , pj denote the depth of leaf node j G C and the probability of reaching that node given 

9 G 0j , respectively. Then, the expected number of queries required to identify the group of an unknown 
object using the given tree is equal to 



E[K] 



^Pr(0 G &)E[K\6 G 9*] 
i=i 



i=i 



> Pjdj 



Theorem 6. The expected number of queries required to identify the group of an unknown object using a 
tree T G T(B, II, y, z, p) is given by 



E{K] = H{U S 



Pa^Sa 



1 - Yl Pz «( q } 



— TTft 



(10) 
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where Uy denotes the probability distribution of the object groups induced by the labels y, i.e. Uy = 

(7T 1,- • • , 7T0m) 

Proof. The proof is given in Appendix I. □ 

Note that Theorems [TJ [2] and [5] are special cases of the above theorem. This theorem states that, given 
a query learning problem (B, II, y, z, p), the expected number of queries required to identify the group of 
an object is lower bounded by the entropy of the probability distribution of the object groups H(U y ). It 
also follows from the above theorem that this lower bound can be achieved iff the reduction factors and 
the group reduction factors of all the queries in a query group at each internal node are equal to 0.5 and 
1 respectively. 

The problem of finding a decision tree with minimum K[K] can be formulated as the following opti- 
mization problem 



T min ^ZaelPaKe a U-Z q eQ^PzM \ H (fa(q)) ~ E™ i ^H(A(<l)j\ } (U) 

TeT(B,n,y,z,p) " L L w « J J 

Note that here the reduction factors {p a (q)}qeQ z a , the group reduction factors {p l a (q)} g& Q^ a for all 
i = 1, ■ ■ ■ , m, and the prior probability function p Za (q) depends on the query group z a 6 {1, • • • , n} chosen 
at node 'a' in the tree. Once again, the above optimization problem being a generalized version of the 
optimization problem in ^ is NP-complete. A greedy top-down optimization of the above objective 
function yields a suboptimal solution where we choose a query group that minimizes the term C a {j) := 



at each internal node, starting from the root node. The 

algorithm as summarized in Algorithm [4] below is referred to as GIGQSA (Group Identification under 
Group Queries Splitting Algorithm). 



Group Identification under Group Queries Splitting Algorithm (GIGQSA) 

Initialization : Let the leaf set consist of the root node 

while some leaf node 'a ' has more than one group of objects do 

> 1 do 

ities of selecting queries within a group, Pj(q),Vq G Q 3 at node 'a' 



for each query group with Q J a 
Compute the prior probabi 

Compute the reduction factors for all the queries in the query group {pa(q)} q ^nj 
Compute the group reduction factors for all the queries in the query group {Pa(q)} q <zQj ■, 
Vi = 1 , • • ■ , m 

Compute the cost C a (J) of using query group j at node 'a' 
end 

Choose a query group j with the least cost C a (j) at node 'a' 

Form the left and the right child nodes for all queries with Pj(q) > in the query group 



end 



Algorithm 4: Greedy decision tree algorithm for group identification under group queries 



6 Query learning with persistent noise 

We now consider the problem of identifying an unknown object 9 £ © through as few queries as possible in 
the presence of persistent query noise, and relate this problem to group identification. Query noise refers 
to errors in the query responses, i.e., the observed query response is different from the true response of the 
unknown object. For example, a victim of toxic chemical exposure may not report a symptom because of 
a delayed onset of that symptom. Unlike the noise model often assumed in the literature, where repeated 
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Figure 5: For the toy example shown in (a) consisting of 2 objects and 3 queries with an e — 1, where queries qi 
and Q3 are prone to persistent noise, (b) demonstrates the construction of matrix B 



querying results in independent realizations of the noise, persistent query noise is a more stringent noise 
model where repeated queries result in the same response. 

We refer to the bit string consisting of observed query responses as an input string. The input string 
can differ from the true bit string (corresponding to the row vector of the true object in matrix B) due 
to persistent query noise. First, we describe the error model and then describe the application of group 
identification algorithms to uniquely identify the true object in the presence of persistent errors. 

Consider the case where a fraction v of the N queries are prone to error. Also, assume that at any 
instance, not more than e of these vN queries are in error, where e := L^2~^J> ^ being the minimum 
Hamming distance between any two rows of the matrix B. The a priori probability distribution of the 
number of errors is considered to be one of the following, 



Probability model 1: 
Probability model 2: 



Pr(e errors) 
Pr(e errors) 



< e < e 



Z^e'=0 I e> ) 



< e < e' 



where e' := min(e,-/W). Note that probability model 2 corresponds to a truncated binomial distribution 
where < p < 0.5 denotes the probability that a query prone to error is actually in error, while probability 
model 1 is a special case of probability model 2 when p = 0.5. Given this error model, the goal is to identify 
the true object through as few queries from Q as possible. 

This problem can be posed as a group identification problem as follows: Given a query learning problem 
(B,n) with M objects and N queries that is susceptible to e errors, with a fraction v of the iV queries 
prone to error, create (B, II) with M groups of objects and iV queries, where each object group in this new 
matrix consists of Xle=o ( e) °bjects corresponding to all possible bit strings that differ from the original 
bit string in at most e' positions corresponding to the vN bits prone to error. Consider the toy example 
shown in Figure 5 (a) [ consisting o f 2 ob jects and 3 queries with an e = 1 where queries <?2 and are prone 
to persistent query noise. Figure 5(b) demonstrates the construction of B for this toy example. 

Each bit string in the object set 0* corresponds to one of the possible input strings when the true 
object is Qi and at most e' errors occur. Also, by definition of e, no two bit strings in the matrix B are 
the same. Given the a priori probabilities of the objects in B, the prior distribution of objects in B is 
generated as follows. For an object belonging to group i in B whose bit string differs in e < e' bit positions 
from the true bit string of 9i, the prior probability is given by 



Probability model 1: 



1 



e'=0 { e> ) 



7T,: 
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p en _ p \Nv-e 

Probability model 2: ; = 7Tj 



Figure 5(b) shows the prior probability distribution of the objects in B using prob ability model 1 (111) and 



probability model 2 with p = 0.25 (II2) for the toy example shown in Figure 5(a) 

Given a query learning problem (B,I1) that is susceptible to e errors, the problem of identifying an un- 
known object in the presence of at most e persistent errors can be reduced to the problem of identifying the 
group of an unknown object in (B, II), where (B, II) is generated as described above. One possible concern 
with this approach could be any memory related issue in generating matrix B due to the combinatorial 
explosion in the number of objects in B. Interestingly, the relevant quantities for query selection in both 
GISA and GBS (i.e., the reduction factors) can be efficiently computed without explicitly constructing the 
B matrix, described in detail in Appendix II. 



7 Experiments 

We perform three sets of experiments, demonstrating our algorithms for group identification, object iden- 
tification using query groups, and query learning with persistent noise. In each case, we compare the 
performances of the proposed algorithms to standard algorithms such as the splitting algorithm, using 
synthetic data as well as a real dataset, the WISER database. The WISER database is a toxic chemical 
database describing the binary relationship between 298 toxic chemicals and 79 acute symptoms. The 
symptoms are grouped into 10 categories (e.g., neurological, cardio) as determined by NLM, and the chem- 
icals are grouped into 16 categories (e.g., pesticides, corrosive acids) as determined by a toxicologist and a 
Hazmat expert. 

7.1 Group identification 

Here, we consider a query learning problem (B,II) where the objects are grouped into m groups given by 
y = (yi) " " " 3 Vm), Hi G {1, • • • 3 w}j with the task of identifying the group of an unknown object from the 
object set through as few queries from Q as possible. First, we consider random datasets generated 
using a random data model and compare the performances of GBS and GISA for group identification in 
these random datasets. Then, we compare the performance of the two algorithms in the WISER database. 
In both these experiments, we assume a uniform a priori probability distribution on the objects. 

7.1.1 Random Datasets 

We consider random datasets of the same size as the WISER database, with 298 objects and 79 queries 
where the objects are grouped into 16 classes with the same group sizes as that in the WISER database. We 
associate each query in a random dataset with two parameters, j w £ [0.5, 1] which reflects the correlation of 
the object responses within a group, and 7^ € [0.5, 1] which captures the correlation of the object responses 
between groups. When 7^ is close to 0.5, each object within a group is equally likely to exhibit or 1 as 
its response to the query, whereas, when 7^ is close to 1, most of the objects within a group are highly 
likely to exhibit the same response to the query. Similarly, when % is close to 0.5, each group is equally 
likely to exhibit or 1 as its response to the query, where a group response corresponds to the majority 
vote of the object responses within a group, while, as 7^ tends to 1, most of the groups are highly likely to 
exhibit the same response. 

Given a (7^,7^) pair for a query in a random dataset, the object responses for that query are created 
as follows 

1. Generate a Bernoulli random variable, x 

2. For each group i G {!,■■■ , to}, assign a binary label bi, where bi = x with probability 7^ 
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d d 2 
1 

Figure 6: Expected number of queries required to identify the group of an object using GBS and GISA on random 
datasets generated using the proposed random data model 

3. For each object in group i, assign 6j as the object response with probability j w 
Given the correlation parameters ('Jw(q)j lb (<?)) ^ [0-5, l] 2 , £ Q, a random dataset can be created by 



following the above procedure for each query. Conversely, we describe in Section 7.1.2 on how to estimate 
these parameters for a given dataset. 

Figure [6] compares the mean ¥,[K] for GBS and GISA in 100 randomly generated datasets (for each 
value of d\ and cfe), where the random datasets are created such that the query parameters are uniformly 
distributed in the rectangular space governed by d\ , e?2 as shown in Figure [7j This demonstrates the 
improved performance of GISA over GBS in group identification. Especially, note that K[K] tends close 
to entropy H(U y ) using GISA as c?2 increases. 

This is due to the increment in the number of queries in the fourth quadrant of the parameter space 
as d,2 increases. Specifically, as the correlation parameters jwjjb tends to 1 and 0.5 respectively, choosing 
that query eliminates approximately half the groups with each group being either completely eliminated or 
completely included, i.e. the group reduction factors tend to 1 for these queries. Such queries are preferable 
in group identification and GISA is specifically designed to search for these queries leading to its strikingly 
improved performance over GBS as d.2 increases. 

7.1.2 WISER Database 



Table 7.1.2 compares the expected number of queries required to identify the group of an unknown object 
in the WISER database using GISA, GBS and random search, where the group entropy in the WISER 
database is given by H(JJ y ) = 3.068. The table reports the 95% symmetric confidence intervals based on 
random trails, where the randomness in GISA and GBS is due to the presence of multiple best splits at 
each internal node. 

However, the improvement of GISA over GBS on WISER is less than was observed for many of the 
random datasets discussed above. To understand this, we developed a method to estimate the correlation 
parameters of the queries for a given dataset B. For each query in the dataset, the correlation parameters 
can be estimated as follows 

1. For every group i G {1, • • • , m}, let bi denote the group response given by the majority vote of object 
responses in the group and let 7^ denote the fraction of objects in the group with similar response 
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Figure 7: Random data model - The query parameters Figure 8: Scatter plot of the query parameters in the 
(lw(q),lb(q)) are restricted to lie in the rectangular space WISER database 



as hi 



Denote by a binary variable x, the majority vote of the group responses b = [b±, 
Then, ■% is given by the fraction of groups with similar response as x, and 7^ = 



Now, we use the above procedure to estimate the query parameters for all queries in the WISER database, 
shown in Figure [8j Note from this figure that there is just one query in the fourth quadrant of the parameter 
space and there are no queries with 7^ close to 1 and 75 close to 0.5. In words, chemicals in the same 
group tend to behave differently and chemicals in different groups tend to exhibit similar response to the 
symptoms. This is a manifestation of the non-specificity of the symptoms in the WISER database as 
reported by Bhavnani et. al. [28 . 



Algorithm 


E[K] 


GISA 
GBS 
Random Search 


7.792 ± 0.001 
7.948 ± 0.003 
16.328 ± 0.177 



Algorithm 


E[K) 


GBS 
GQSA 
min i min 9eQl Pi(q)p a (q) 
mniimaXggQi Pi(q)p a (q) 
Random Search 


8.283 ± 0.000 
11.360 ± 0.096 
13.401 ± 0.116 
18.697 ± 0.357 
20.251 ± 0.318 



Table 1 : Expected number of queries required to identify 
the group of an object in WISER database 



Table 2: Expected number of queries required to identify 
an object under group queries in WISER database 



7.2 Object identification under query classes 

In this section, we consider a query learning problem (B,I1) where the queries are a priori grouped into 
n groups given by z = (z\, ■ ■ ■ , zn), Zj E {1, • • • , n}, with the task of identifying an unknown object from 
the set through as few queries from Q as possible, where the user is presented with a query group at 
each stage to choose from. Note that this approach is midway between a complete active search strategy 
and a complete passive search strategy. Hence, we primarily compare the performance of GQSA to a 
completely active search strategy such as GBS and a completely passive search strategy like random search 
where the user randomly chooses the queries from the set Q to answer. In addition, we also compare 
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Figure 9: Expected number of queries required by differ- 
ent algorithms for object identification under group queries 
in random datasets 



Figure 10: Comparison between the performance of GBS 
and GISA in identifying the true object in the presence of 
restricted persistent noise under probability model 1 



GQSA to other possible heuristics where we choose a query group i that minimizes min qe Qi Pi{q)p a {q) or 
maXggQi Pi(q) p a (q) at each internal node 'a'. 

First, we compare the performances of these algorithms on random datasets generated using a random 
data model. Then, we compare them in the WISER database. In both these experiments, we assume 
uniform a priori probability distribution on the objects as well as on queries within a group. The latter 
probability distribution corresponds to the probability of a user selecting a particular query q from a query 
group, Pi(q),Vi = !,-•• ,n. 



7.2.1 Random Datasets 

Here, we consider random datasets of the same size as the WISER database, with 298 objects and 79 
queries where the queries are grouped into 10 groups with the same group sizes as that in the WISER 
database. We associate a random dataset with a parameter j max £ [0.5, 1], where y max corresponds to the 
maximum permissible value of 7& for ci c[U6ry in the random da/tciset. Given a 'Ymaxi & 

random dataset is 

created as follows 

1. For each query group, generate a 7^ £ [0.5,7 max ] 

2. For each query in the query group, generate a Bernoulli random variable x and give each object the 
same query label as x with probability 7^ 

Figure [9] compares the mean K[K] for the respective algorithms in 100 randomly generated datasets, 
for each value of "f ma x- The minmin corresponds to the heuristic where we minimize min gg Qi Pi{q)Pa{q) at 
each internal node and the minmax corresponds to the heuristic where we minimize max gg Qi Pi{q)pa{q)- 
Note from the figure that in spite of not being a completely active search strategy, the performance of 
GQSA is comparable to that of GBS and better than the other algorithms. 



7.2.2 WISER Database 



Table 7.1.2 compares the expected number of queries required to identify an unknown object under group 
queries in the WISER database using the respective algorithms, where the entropy of the objects in the 
WISER database is given by Hiji) = 8.219. The table reports the 95% symmetric confidence intervals 
based on random trials, where the randomness in GBS is due to the presence of multiple best splits at each 
internal node. 
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v (fraction of queries that are prone to error) 



Figure 1 1 : Comparison between the performance of GBS 
and GISA in identifying the true object in the presence of 
restricted persistent noise under probability model 2 
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Figure 12: Comparison between the performance of GBS 
and GISA in probability model 2 in the presence of dis- 
crepancies between the true value of p, ptrue &nd the value 
used in the algorithm p a i g 



Once again, it is not surprising that GBS outperforms GQSA as GBS is fully active, i.e, it always 
chooses the best split, whereas GQSA does not always pick the best split, since a human is involved. Yet, 
the performance of GQSA is not much worse than that of GBS. Infact, if we were to fully model the 
time-delay associated with answering a query, then GQSA might have a smaller "time to identification," 
because presumably it would take less time to answer the queries on average. 

7.3 Query learning with persistent noise 

In Section |6j we showed that identifying an unknown object in the presence of persistent errors can be 
reduced to a group identification problem. Hence, any group identification algorithm can be adopted to 
solve this problem. Here, we compare the performance of GBS and GISA in identifying the unknown 
object in the presence of persistent errors. 

Note from Section ^ that the generation of matrix B requires the knowledge of the queries from the 
set Q that are prone to error. We assume this knowledge in all our experiments in this section. Below, we 
show the procedure adopted to simulate the error model, 

1 . Select the fraction v of the ./V queries that are prone to error 

2. Generate e S {0, • • • , e'} according to the selected probability model 

3. Choose e queries from the above Nu set of queries 

4. Flip the object responses of these e queries in the true object 

We compare the performance of GBS and GISA in a subset of the WISER database consisting of 131 



toxic chemicals and 79 symptom queries with e = 2. Figure 10 shows the expected number of queries 



required by GBS and GISA to identify the true object in the presence of a maximum of e persistent errors 



for different values of v, using probability model 1. Figure 11 shows the same for different values of p 
using probability model 2. Note that except for the extreme cases where v = and v = 1, GISA has great 
improvement over GBS. When v = 0, 1, GBS and GISA reduce to the same algorithm. 

Also, note that in probability model 2, the algorithms requires the knowledge of p as n depends on p. 
Though this probability can be estimated with the help of external knowledge sources beyond the database 
such as domain experts, user surveys or by analyzing past query logs, the estimated value of p can vary 
slightly from its true value. Hence, we tested the sensitivity of these two algorithms to error in the value 
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of p and noted that there is not much change in their performance to discrepancies in the value of p as 
shown in Figure 12 



8 Conclusions and Future work 

In this paper, we developed algorithms that broaden existing methods for query learning to incorporate 
factors that are specific to a given task and environment. These algorithms are greedy algorithms derived 
in a common, principled framework based on a generalization of Shannon-Fano coding to group-based 
query learning. While our running example has been toxic chemical identification, the methods presented 
are applicable to a much broader class of applications, such as other forms of emergency response, fault 
diagnosis, network failure diagnosis or Internet based data search. 

In a series of experiments on synthetic data and a toxic chemical database, we demonstrated the 
effectiveness of our algorithms relative to the standard splitting algorithm, also known as generalized 
binary search (GBS), which is the most commonly studied algorithm for query learning. In some settings, 
our algorithms outperform GBS by drastic amounts. Furthermore, in the case of group identification, we 
have described a simple visualization (see Figure [8J, based on the underlying data matrix, that explains 
how much can be gained from GISA, our group identification algorithm. That is, it offers a picture of how 
much GISA will improve upon GBS without running either algorithm. 

While this work is a step towards making query learning algorithms better suited to real-world iden- 
tification tasks, there are many other issues that deserve to be examined in future work. These include 
challenges such as multiple objects present, probabilities of query response or query noise, or user confi- 
dence. In query learning with persistent noise, our approach can only recover from a restricted number of 
query errors, depending on the minimum Hamming distance between objects. While this assumption is 
required if we desire unique identification, it would be interested to loosen this assumption by pursuing a 
slightly less ambitious goal. Additionally, instead of minimizing the expected number of queries required 
for object/group identification, it would be valuable to develop a similar framework that minimizes the 
number of queries in the worst case, thereby eliminating dependence on the prior probabilities. Finally, 
it seems plausible that performance results like those proved in [2] might also be possible for group-based 
query learning. 



Appendix I - Proofs 
8.1 Proof of Theorem |3] 

Let T a denote a subtree from any node 'a' in the tree T and let C a denote the set of leaf nodes in this 
subtree. Then, let \i a denote the expected depth of the leaf nodes in this subtree, given by 



7T0 



3 ja 



where df- corresponds to the depth of leaf node j in the subtree T a , and let H a denote the entropy of the 
probability distribution of the classes at the root node of the subtree T a , i.e. 



Now, we show using induction that for any subtree T a in the tree T, the following relation holds 



where I a , C a denotes the set of internal nodes and the set of leaf nodes in the subtree T a respectively. 
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The relation holds trivially for any subtree rooted at a leaf node of the tree T with both the left hand 
side and the right hand side of the expression equal to — 7re a /(O a ) (Note from ^ that I(& a ) = H a ). Now, 
assume the above relation holds for the subtrees rooted at the left and right child nodes of node 'a'. Then, 
using Lemma [T] we have 



VT6 a \Pa - H a ] = TT @l{a) [/Xj( a ) - H l{a) ] + 7Te r(a) \p r (a) ~ H r(a)} + 7T6a 



1 - H( Pa ) + £ ^H(pi) 



i=l s J se£,r„i 



•i(a) 



+ £ 

S&Ia 



i=l Us J s e£ r{a) 



1 - H{ Pa ) + ^ ^H(pi) 



1-^+53^^ 



i=l 



Tre., 



thereby completing the induction. Finally, the result follows by applying the relation to the tree T whose 
probability mass at the root node, 7re a = 1. 



Lemma 1. 

^Ba W ~ H a] = 7T0 ;{a) \Pl(a) ~ H l{a)\ + ^B r(a) \Pr(a) ~ H r{a)\ + ^6. 

Proof. We first note that 7re a £i a for a subtree T a can be decomposed as 

KOa^a = 5J n ®i d< j 

jec a 

53 7r e ^+ 53 7re 3 ^ 



i - + 53 ^^(pd 



•J(o) 



js£ r ( a ) 



53 ^(^-1)+ 53 ^-i) + 53 



Similarly, ir® a H a can be decomposed as 



(12) 



1=1 
m 



""e* , lo § — + 2^ ^e* , lo § — 

1=1 w a 1=1 w a 

m m 7r„i m 

^e* , lo S ~ + £ ^ej, , lQ g — — + £ ^e* , log ~ 

1=1 U i( a ) 1=1 U a j=l W «(a) 

m m m _ 

+ Z^ 7T ®\ , log ~ +L 1 e t M log ~lr — + L^ (1 log ^~ 



(a) 2 = 1 
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Ke l{a) H l{a) + TT0 r{a) H r{a) - 



i=i 



vr e i lo. 



!(«) 7T, 



■ + vr es log 



i(a) 



r ( a ) VTg,i 



r(a) 



+ 



^e I(a) log-f^ + 7r Gr(a) log 



(a) 



7r e 



r(a) 



i=l 



The result follows from (12) and (13) above. 



(13) 



□ 



8.2 Proof of Theorem H 

From relation (13) in Lemma [TJ we have 



1(a) 



( a ) TT I ' ,c, r(a) jj- 



^8. 



r(a) 



i=l 



Thus, maximizing the impurity based objective function with entropy function as the impurity function is 
equivalent to minimizing the cost function C a := 1 — H(p a ) + YaLi ^ L H(p a ) 



8.3 Proof of Theorem H 

Let T a denote a subtree from any node 'a' in the tree T and let C a denote the set of leaf nodes in this 
subtree. Then, let p a denote the expected number of queries required to identify the group of an object 
terminating in a leaf node of this subtree, given by 



E — 



where d^p" denotes the depth of leaf node j in the subtree T a and the probability of reaching that leaf 
node given 6 S Qj, respectively, and let H a denote the entropy of the probability distribution of the object 
groups at the root node of this subtree, i.e. 



Now, we show using induction that for any subtree T a in the tree T, the following relation holds 



7re a Pa - n 0a H a = ^ p> es I 1 - Yj 



sei a 



<?GQ Z 



TT 

H{p s {q)) ~ £ ^H(pi(q)) 



where I a denotes the set of internal nodes in the subtree T a . 

The relation holds trivially for any subtree rooted at a leaf node of the tree T with both the left hand 
side and the right hand side of the expression being equal to 0. Now, assume the above relation holds 
for all subtrees rooted at the child nodes of node 'a'. Note that node 'a' has a set of left and right child 
nodes, each set corresponding to one query from the query group selected at that node. Then, using the 



21 



decomposition in Lemma [T] on each query from this query group, we have 



1 • 7Te o [/% ~H a }= PzM^QaW ~ H a] 



q£Q z " 



E Pza(q) {^e^) [/%(a) ~ #J«(a)] + 7re r9{a) [/i r q( a ) ~ H r q( a )] 



qeQ z 



l-H{p a { q ))-Y,— H ^a(<l)) 
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X] P'ail) \KGl<Ha)\PlHa) ~ H ll(a)] + ^ ® r<1 {a) [p r 1 {a) ~ #r"J(a)]} 



q&Q z <* 



H(p a (q))-J2— H ^(Q)) 



where l q (a),r q (a) correspond to the left and right child of node 'a' when query q is chosen from the query 
group and ///g( a ), ^e i9(a) > #w(a) correspond to the expected depth of a leaf node in the subtree TJ«( a ); prob- 
ability mass of the objects at the root node of this subtree, and the entropy of the probability distribution 
of the objects at the root node of this subtree respectively. Now, using the induction hypothesis, we get 



-„„//„ -7re fla= J2pzAq){ E Ps {a ^e s 



sei. 



11(a) 
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E E ^ (a) -e s 
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oeo z » V i=i 71-08 / 



+ 7r e a jl- E 



m 



1=1 



7I "e Q 



E^m 1 - E 

SSla I <?GQ Zs 



- E — ff o»i(«)) 



8=1 



thereby completing the induction. Finally, the result follows by applying the relation to the subtree rooted 
at the root node of T, whose probability mass 7re a = 1. 



Appendix II 

Reduction factor calculation in the persistent noise model 

At any internal node a E 2 in a tree, let 5f denote the Hamming distance between the query responses 
up to this internal node (Q a ) and the true responses of object 9i to those queries. Also, let n a denote 
the number of queries from the set of Nv queries (that were prone to error) in the set Q \ Q a and for 
a query q E Q \ Q a , denote by bi(q) the binary response of object 6{ to that query. Denote by the set 
I a = {i : 5f < e'}, the object groups with non-zero number of objects at this internal node. All the 
formulas below come from routine calculations based on probability model 2. 

For a query q E Q\Q a , that is not prone to error, the reduction factor and the group reduction factors 
generated by choosing that query at node 'a' are as follows. The group reduction factor of any group i E I a 
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is equal to 1 and the reduction factor is given by 



max < 


E T» 


E ( n e a )p e+<5 ?(i -p)^ 6 "' 5 ? 

L e=0 6 




E (^P^' i 1 ~ p) NV ~ e ~ 5 i 
e=0 6 


1 


E Ki 


E { n e)p e+5 K l -p) Nv - e - 5 * 
e=0 6 





Pa 



where 7 fl = bi(q) = 0}, If = {i € J a : = 1} and rf = min(n a , e' - <5f). 

In addition, for a query q £ Q \ Q a that is prone to error, denote by 5f a \ the Hamming distance 
between the user responses to queries up to the left and right child node of node 'a' with query q chosen 
at node 'a', and the true responses of object 9i to those queries. In particular, S 1 ^ = Sf + \bi(q) — 0| and 
S r ^ = Sf + \bi(q) — 1|. Then, the reduction factor and the group reduction factors generated by choosing 
this query at node 'a' are as follows. The group reduction factor of a group i £ I a whose Sf = e' is equal 
to 1 and that of a group whose Sf < e' is given by 



max < 



Pa 



1(a) 



r(a) 



E {^p^ {l - p) Nv - e - &l > a \ E {X 1 )?^ ^ - p) Nu ~ e ~ s:(a) 



e=0 



e=0 



e=0 6 

where r?^ = min(n a — 1, e' — S 1 ^) and = min(n a — 1, e' — <5[^), and the reduction factor is given by 
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